Introduction

This is the 4th project of Udacity’s Data Analyst NanoDegree Program. We were given several data sources as options to analyze from. I chose the Arizona’s 2016 Presidential Campaign Finance from the Federal Election Commission (FEC) website.

The format of this analysis will be as the following:

1- I will declare my intentions with a hypothesis (if applicable)

2- Insert a R snippet/code and run it

3- Declare my findings.

And so on…

First I will begin the analysis by exploring basic statistics about the data set. This will help me see the nature of the data, and whether the data needs cleaning or wrangling. Afterwards, I will explore variable and multivariate relationships, by using the methods I have learned in chapter 4, such as scatter, line, box plots and histograms. This is the basic outline of the analysis, but surely I will find interesting things to talk about along the way.

Data Preparation and Munging

The structure of the data is as the following The file has 19 variables, and these are the most important ones to for the analysis:

I wish if there was a party and a gender column, I will try to it below.

# Add party col
# Note code template was taken from Udacity Forums

index <- c("Johnson, Gary", "Stein, Jill", "McMullin, Evan")
dindex <- c("Clinton, Hillary Rodham", "Sanders, Bernard", "Lessig, Lawrence",
            "O'Malley, Martin Joseph", "Webb, James Henry Jr.")
rindex <- c('Bush, Jeb', "Carson, Benjamin S."
            , "Christie, Christopher J", "Cruz, Rafael Edward 'Ted'",
            "Fiorina, Carly", "Gilmore, James S III" ,
            "Graham, Lindsey O.", "Huckabee, Mike", 
            "Jindal, Bobby", "Kasich, John R.",
            "Paul, Rand", "Perry, James R. (Rick)",
            "Rubio, Marco", "Trump, Donald J.",
            "Walker, Scott" )
attach(az)
az$party[cand_nm %in% index] <- "independent"
az$party[cand_nm %in% dindex] <- "democrat"
az$party[cand_nm %in% rindex] <- 'republican'
detach(az)

# Convert party to factor
az$party <- factor(az$party)

I also would like to add other information such as latitudes and longitudes for map analysis

I would also like to integrate population data by zip-code from the 2010 ZCTA census.

Now that I added candidates’ genders, I’ll add the contributors’ genders, by using the gender package.

Exploratory Data Analysis

I would like to know how the data is distributed.

Date distribution

It appears that we have negative numbers, that goes all to -5400. I believe that it represents refunds, since the most receipt comment is receipt.

I want to find out the amount stats without the refunds.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.04    15.00    27.00    80.21    61.86 10800.00
##  'table' int [1:2224(1d)] 1 1 2 2 1 25 1 861 1 1 ...
##  - attr(*, "dimnames")=List of 1
##   ..$ non_zero_rec: chr [1:2224] "0.04" "0.12" "0.24" "0.5" ...
## non_zero_rec
##     3     5     8    10    15    19    20    25    27    28    35    38 
##  1126  7727  1869 11486  5699  2539  2638 17543  5432  2258  2322  1014 
##    40    50    75    80   100   200   250   500 
##  2340 13346  1325  2038 11547  1941  4095  1380

I wonder why the odd numbers such as 19, 27 or even 38.

Below we will see the number of contributions for each candidate and how they break out:

##   Clinton, Hillary Rodham          Sanders, Bernard 
##                     53861                     35784 
##          Trump, Donald J. Cruz, Rafael Edward 'Ted' 
##                     16087                      7129 
##       Carson, Benjamin S.              Rubio, Marco 
##                      2954                      1657 
##            Fiorina, Carly                Paul, Rand 
##                       462                       426 
##             Johnson, Gary           Kasich, John R. 
##                       318                       263 
##               Stein, Jill                 Bush, Jeb 
##                       199                       122 
##            Huckabee, Mike            McMullin, Evan 
##                       101                        98 
##             Walker, Scott   O'Malley, Martin Joseph 
##                        95                        29 
##  Christie, Christopher J.      Santorum, Richard J. 
##                        19                        19 
##             Jindal, Bobby        Graham, Lindsey O. 
##                        10                         9 
##     Webb, James Henry Jr.          Lessig, Lawrence 
##                         5                         4 
##    Perry, James R. (Rick)      Gilmore, James S III 
##                         1                         0

I would like to see the box-plot of each gender/party contribution. Republicans contributed more on average, and they had a higher range of contribution amounts. Male republicans contributed slightly more on average than their female counterparts.

## 
## female   male 
##  65569  54083

I did not expect to find more female contributors than males in this data-set.

Lets explore if females were more likely to vote for female candidates.

## [1] 0.546188
## [1] 0.653514

%54.5 females of this data-set contributed to females, while %65.4 of males contributed to males, which is a negligible preference.

##    democrat independent  republican           N 
##   80.341880    1.709402   17.948718  117.000000
##   Clinton, Hillary Rodham          Sanders, Bernard 
##               59.54716981               36.50314465 
## Cruz, Rafael Edward 'Ted'          Trump, Donald J. 
##                1.43396226                1.40880503 
##       Carson, Benjamin S.              Rubio, Marco 
##                0.42767296                0.20125786 
##               Stein, Jill                Paul, Rand 
##                0.20125786                0.15094340 
##        Graham, Lindsey O.            Fiorina, Carly 
##                0.05031447                0.02515723 
##             Johnson, Gary           Kasich, John R. 
##                0.02515723                0.02515723 
##                 Bush, Jeb  Christie, Christopher J. 
##                0.00000000                0.00000000 
##      Gilmore, James S III            Huckabee, Mike 
##                0.00000000                0.00000000 
##             Jindal, Bobby          Lessig, Lawrence 
##                0.00000000                0.00000000 
##            McMullin, Evan   O'Malley, Martin Joseph 
##                0.00000000                0.00000000 
##    Perry, James R. (Rick)      Santorum, Richard J. 
##                0.00000000                0.00000000 
##             Walker, Scott     Webb, James Henry Jr. 
##                0.00000000                0.00000000

Around %80 of colleges had a democratic preference. The majority of %59 of contributions were for Clinton, Sanders cones in second of %37. Cruz came in third (%1.42) and Trump close fourth (%1.4).

Below I will find the stats of homemakers and retirees

##   first_name         clean_zip           cmte_id               cand_id   
##  Length:1076        Length:1076        Length:1076        P00003392:600  
##  Class :character   Class :character   Class :character   P60006111:146  
##  Mode  :character   Mode  :character   Mode  :character   P60007168:105  
##                                                           P60005915: 99  
##                                                           P80001571: 84  
##                                                           P60006723: 14  
##                                                           (Other)  : 28  
##                       cand_nm                         contbr_nm  
##  Clinton, Hillary Rodham  :600   BORCH, INGER              : 39  
##  Cruz, Rafael Edward 'Ted':146   GUIDARELLI-AMBRAD, DEBORAH: 35  
##  Sanders, Bernard         :105   FRANK, GLORIA             : 29  
##  Carson, Benjamin S.      : 99   FRANZ, ROBIN              : 29  
##  Trump, Donald J.         : 84   DOVER, RITA               : 28  
##  Rubio, Marco             : 14   BADE, KRISTI              : 27  
##  (Other)                  : 28   (Other)                   :889  
##           contbr_city  contbr_st     contbr_zip      contbr_employer
##  SCOTTSDALE     :212   AZ:1076   857507118: 39   N/A         :531   
##  TUCSON         :171             852533610: 35   HOMEMAKER   :306   
##  PHOENIX        :143             852043820: 29   RETIRED     : 59   
##  GILBERT        : 85             852951792: 29   NONE        : 41   
##  MESA           : 83             853021415: 28   NOT EMPLOYED: 39   
##  PARADISE VALLEY: 57             852543072: 27   MY CHILDREN : 25   
##  (Other)        :325             (Other)  :889   (Other)     : 75   
##                       contbr_occupation contb_receipt_amt
##  HOMEMAKER                     :1028    Min.   : -40.0   
##  UNEMPLOYED - HOMEMAKER        :  25    1st Qu.:  25.0   
##  HOMEMAKER / PHOTOGRAPHER / MSW:   5    Median :  50.0   
##  HOMEMAKER/ACTIVIST/ARTIST     :   5    Mean   : 137.1   
##  HUSBAND/MECHANICWIFE/HOMEMAKER:   5    3rd Qu.: 100.0   
##  HOMEMAKER/PHYSICIAN           :   3    Max.   :2700.0   
##  (Other)                       :   5                     
##   contb_receipt_dt
##  19-OCT-16:  14   
##  03-NOV-16:  12   
##  06-NOV-16:  12   
##  09-OCT-16:  12   
##  26-SEP-16:  12   
##  04-NOV-16:  11   
##  (Other)  :1003   
##                                                            receipt_desc 
##                                                                  :1076  
##  * EARMARKED CONTRIBUTION: SEE BELOW REATTRIBUTION/REFUND PENDING:   0  
##  * REATTRIBUTED FROM EDWARD FARMILANT                            :   0  
##  * REATTRIBUTED TO BARBARA FAMILANT                              :   0  
##  * REATTRIBUTED TO VICTORIA STRONG                               :   0  
##  EVENT PLANNING REATTRIBUTION FROM SPOUSE                        :   0  
##  (Other)                                                         :   0  
##  memo_cd                               memo_text    form_tp   
##   :906                                      :872   SA17A:912  
##  X:170   * EARMARKED CONTRIBUTION: SEE BELOW: 99   SA18 :164  
##          * HILLARY VICTORY FUND             : 98   SB28A:  0  
##          *BEST EFFORTS UPDATE               :  5              
##          *                                  :  1              
##          EARMARKED FROM MAKE DC LISTEN      :  1              
##          (Other)                            :  0              
##     file_num                       tran_id     election_tp
##  Min.   :1014598   C5628470            :   2        :  4  
##  1st Qu.:1077853   A105C04C73FFA4C859DB:   1   G2016:452  
##  Median :1109498   A6BF5A3EFECE4468B9E9:   1   O2016:  1  
##  Mean   :1103419   A85C4E16099CC4E5F8A1:   1   P2016:619  
##  3rd Qu.:1133930   AAA1CD0DBF8AB4B9281D:   1   P2020:  0  
##  Max.   :1146165   AFCCA0974E8D949428D0:   1              
##                    (Other)             :1069              
##   proper_date                 party         city          
##  Min.   :2015-04-01   democrat   :705   Length:1076       
##  1st Qu.:2016-02-27   independent: 14   Class :character  
##  Median :2016-06-21   republican :357   Mode  :character  
##  Mean   :2016-05-23                                       
##  3rd Qu.:2016-09-21                                       
##  Max.   :2016-12-02                                       
##                                                           
##     state              latitude       longitude      cand_gender 
##  Length:1076        Min.   :31.49   Min.   :-114.6   a     :  0  
##  Class :character   1st Qu.:33.30   1st Qu.:-112.1   Female:602  
##  Mode  :character   Median :33.49   Median :-111.9   Male  :474  
##                     Mean   :33.37   Mean   :-111.8               
##                     3rd Qu.:33.62   3rd Qu.:-111.7               
##                     Max.   :36.62   Max.   :-109.4               
##                                                                  
##  contrib_gender
##  female:1035   
##  male  :  41   
##                
##                
##                
##                
## 
##      female        male           N 
##   96.189591    3.810409 1076.000000
##    democrat independent  republican           N 
##   65.520446    1.301115   33.178439 1076.000000
## [1] 35.57178
##    democrat independent  republican        NA's 
## 60.00657639  0.44540101 39.52410845  0.02391415
## [1] 81.05961

Homemakers are %96 females, and %65 of homemakers are democrats.

As we can see above, retirees make up about %35.6 of the data-set. Around %60 of retirees contributed to democrats and around %40 percent to republicans, contributions to independents are negligible. Retirees contributed $81 on average.

I want to know which occupations are most politically active, and how do they lean politically.

##                        [,1]
## RETIRED               32749
## NOT EMPLOYED          13737
## INFORMATION REQUESTED  3214
## ATTORNEY               2107
## PHYSICIAN              1897
## TEACHER                1821
## ENGINEER               1389
## CONSULTANT             1272
## PROFESSOR              1239
## SALES                  1238

The most politically active occupations in the data set are attorneys, physicians then teachers.

Below I would like to know the proportions of party leaning for each job. For example, of all engineers how many percent of them lean republican (number of republican engineers/ total number of engineers).

I would like to see average spending along dates

It seems that the avg amount of contributions are huge at the beginning of 2015, but when I added a 4th variable (n = number of contributions) it shows that these were a few outliers, the mass of the contributions came in mid 2016 as it lowered the average but the size (n) was bigger substantially.

Note: there is discrepancy in the color scale:

Brain storming: What can I do to improve?

What kind of graphs could I add? -Bar chart of most contributing jobs to the Donald

I am wondering what kind jobs contributed to Donald trump, my intuition says it’s mostly blue collar jobs. Let’s find out!

##  [1] Trump, Donald J.          Sanders, Bernard         
##  [3] Cruz, Rafael Edward 'Ted' Clinton, Hillary Rodham  
##  [5] Stein, Jill               Carson, Benjamin S.      
##  [7] Paul, Rand                Fiorina, Carly           
##  [9] Rubio, Marco              Johnson, Gary            
## [11] Bush, Jeb                 Kasich, John R.          
## [13] Santorum, Richard J.      McMullin, Evan           
## [15] Webb, James Henry Jr.     Huckabee, Mike           
## [17] Walker, Scott             Christie, Christopher J. 
## [19] Jindal, Bobby             O'Malley, Martin Joseph  
## [21] Lessig, Lawrence          Graham, Lindsey O.       
## [23] Perry, James R. (Rick)   
## 24 Levels: Bush, Jeb Carson, Benjamin S. ... Webb, James Henry Jr.

My hypothesis is false, most of trumps contributors have white collar jobs, even the higher income types such as engineers, consultants, physicians and CEOs. One weakness of this plot, it does not represent low income contributors.

let me see by number of contributions only if it helps me find out more,

## [1] 0.1344482

The percentage of contributions for trump of the whole data-set is 18%.

By changing some of the subset filters, still the majority were high income occupations, even though we have some blue collar jobs such as truck driver and construction, but they were the minority. My hypothesis is blue-collar workers cannot afford to contribute therefore, they are underrepresented in this data-set.

I want to see if higher income zip-codes had more contributions and I will use a scatter-plot to demonstrate. There is only a strong relationship When I subsetted the data to 100 contributions at least per zip-code. Doing otherwise will skew the data and the relationship will not be apparent.

This is obvious but still I would like to see the relationship between number of contributions and population of zip-code.

There is a strong correlation at first, but then as population increases the relationship weakens.

Final Plots:

## TableGrob (2 x 1) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (2-2,1-1) arrange gtable[layout]

Reflection:

overall this project was a good challenge and learning experience. At first it was easy and enjoyable exploring the data, as I went deeper into the analysis it became harder to come up with relationships and conclusions about the data. I wanted my analysis to have a central theme/thesis, the fact of not drawing a certain conclusion made me feel frustrated.

I was impressed with the versatility of R, and its packages, I felt like it was more intuitive than python, maybe because I have a background with Alteryx. Although, R felt like it had less support on stackoverflow than python, but there’s support nonetheless, which aided me significantly throughout the project. I also used Datacamp for filling in the knowledge gaps and reinforcing the concepts learned in the Udacity curriculum. I have not utilized Udacity’s live help as much as the other projects, because I did not face problems with programming itself, rather than loss of ideas and direction of my analysis.

In terms of visualizations, R is fantastic for data exploration, although it is lacking the ability to export high resolution plots. I feel that Tableau is more suitable for findings/conclusive plots.